This report explores a dataset containing chemical information and the quality score of different labels of white wine.

Dataset Summary

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

The dataset consists of 13 variables, with 4898 observations.

Univariate Plots Selection


Distribution of Quality Scores

The quality scores of wines seem to be normally distributed. There are very few wines being rated as 3 (very poor quality) and 9 (excellent quality). The mean(red line), median(blue line), and mode of quality ratings all fall nearby the score of 6. Based on the information given in the dataset, I wonder which factors can effectively represent the quality score of white wine.

Distribution of Independent Variables

We can see that most of the independent variables are normally distributed, except for residual sugar. We’ll perform log transformation to get a better representation of the distribution. Another interesting factor to consider is the two SO2 content variables. We can analyze the proportion of free SO2 in later analysis.

whiteWine$prop_free.sulfur.dioxide <- whiteWine$free.sulfur.dioxide / whiteWine$total.sulfur.dioxide

The log transformed residual.sugar distribution appears bi-modal with the peaks at around 1.1-1.6 and 8.0 g/dm^3 or so.
Chloride levels don’t seem to differ much across the wines in the dataset.
Proportions of free sulfur dioxide is distributed normally with a peak at around 25% - 28% or so.

Univariate Analysis


1. What is the structure of the dataset?

There are 4,898 different labels of white wine with 12 features (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, and quality). All the variables are continuous.
The quality rating is on a scale of 0 (very bad) to 10 (very good). Wines in the current dataset only covers ratings of 3-9.

Other observations:
* Most white wine in the dataset have very little residual sugar content (around 1g per cubic decimeter).
* Most wines contain similar amounts of salt (sodium chloride), which peaks at .03-.06g per cubic decimeter.

2. What is/are the main features of interest of the dataset?

The main feature of interest is the quality ratings. We look to investigate which chemicals influence the quality rating of white wines.

3. What other features in the dataset do you think will help support your investigation into your features of interest?

Both residual.sugar and alcohol levels have interesting distributions. There may also be interrelationships between some of the variables.

4. Did you create any new variables from existing data?

Yes, I took the percentage of free.SO2 level according to total.SO2 level to calculate the proportion of free sulfur dioxide content.

5. Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

As the quality ratings are only integers from 3-9, I changed it into an ordered factor.
Residual sugar was distributed with a long tail thus a log transformation was performed.
There may be outliers in the dataset, but I kept them for further studies of the best and worst wines.

Bivariate Plots Selection


The four highest correlation coefficients of variables with quality: quality~alcohol

cor(as.numeric(whiteWine$quality), whiteWine$alcohol)
## [1] 0.4355747

quality~chlorides

## [1] -0.2099344

quality~density

## [1] -0.3071233

quality~prop_free.sulfur.dioxide

## [1] 0.1972141

Some observations:
* Correlation coefficients for quality and other variables are not displayed. However, boxplots of quality ~ alcohol and quality ~ density show interesting patterns that worth further investigation.
* As the selected correlation coefficients have shown, quality of wine cannot be sufficiently predicted by any chemical content alone.
* Density and residual sugar seem to be positively correlated (r = 0.839).
* Chlorides and alcohol are slightly negatively correlated (r = -0.36).
* Density and alcohol are also moderately correlated (r = -0.78).
We’ll take a further look into these variables.

Scatterplot of Quality And Selected Variables

From the plots above we can see patterns of slight correlations between quality and alcohol, chlorides, density, and free SO2 proportion.
More specifically, the quality rating increases as alcohol content increases. The same pattern applies to free SO2 proportion levels.
In contrast, the higher density level a wine has, the its quality rating it may get.
In terms of chlorides, there is no obvious pattern due to too many extreme values.

Alcohol Content

We can see from the stacked histogram above that low-quality wines gather at the left side of the graph while the higher quality ones on the right side.
The relationship is not easy to see as there are too many levels of quality. Next we’ll regroup ratings into three buckets for a better color visualization.

## 
##  (2,5]  (5,7] (7,10] 
##   1640   3078    180

Conditional means/medians of alcohol content among three quality groups:

## # A tibble: 3 × 4
##   quality.bucket alcohol_mean alcohol_median     n
##            <ord>        <dbl>          <dbl> <int>
## 1          (2,5]      9.84953            9.6  1640
## 2          (5,7]     10.80197           10.8  3078
## 3         (7,10]     11.65111           12.0   180

It’s clearly shown that high quality wines tend to have higher alcohol content and poorer quality wines to have lower alcohol content.

Chlorides (Eliminated top 2% chloride levels)

After adjusting for the long tail, the difference among the three quality groups is still not very clear to see. Although the difference in chloride level is minimal, we can see that higher content of chlorides tend to exist among lower quality wines.

Density (Eliminated top 0.5% density levels)

Conditional means/medians of density among three quality groups:

## # A tibble: 3 × 4
##   quality.bucket density_mean density_median     n
##            <ord>        <dbl>          <dbl> <int>
## 1          (2,5]    0.9951600        0.99514  1640
## 2          (5,7]    0.9935299        0.99305  3078
## 3         (7,10]    0.9922144        0.99162   180

Although the conditional mean/median comparison indicates that the difference among the groups is minimal, the density plot shows that within a small range of density levels, poorer quality wines tend to have higher density level and better quality wines tend to have lower.

Free SO2 Proportion

Conditional means/medians comparison among three quality groups:

## # A tibble: 3 × 4
##   quality.bucket  SO2_mean SO2_median     n
##            <ord>     <dbl>      <dbl> <int>
## 1          (2,5] 0.2322617  0.2310096  1640
## 2          (5,7] 0.2660284  0.2626691  3078
## 3         (7,10] 0.2892855  0.2876712   180

The density plot suggests that free SO2 proportion levels do not differ very much across the three quality buckets. The correlations we found from the scatterplot matrix may have been the result of covariance. We’ll plot some of these variables together to further see their interrelationships.

Bivariate Analysis


1. Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Quality correlates strongly with alcohol content.
The variance in alcohol levels peaks among the lower and the higher quality wines. High-quality wines tend to have higher alcohol content (11-13%) and low-quality wines tend to have lower alcohol content (8.5-10%). Medium-quality wines typically spread out nicely across alcohol content around 9-13%.

Density level also seems to have a high influence on the quality ratings of wine. The interactions between some of these factors may be important to look further into when we try to predict the quality ratings of wines.

2. Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

It seems that the density level of wine is correlated with many other features. It is highly correlated with residual sugar - an increase in residual sugar content increases density levels. Density is also moderately correlated with alcohol content, a decrease in alcohol level will increase the density of wines.

Other interesting correlations are found between chlorides and alcohol, and residual sugar and alcohol. The correlation coefficient is around 0.35-0.4, but it’s hard to see the relation from the scatter plots. We’ll plot them together to see further interactions.

3. What was the strongest relationship you found?

The strongest correlation was between quality rating and alcohol levels. The correlation is even stronger when we put wines of different quality buckets.
As we’ve found other variables that could potentially share covariance with alcohol levels, we’ll investigate further on chlorides, residual sugar, and density levels.

Multivariate Plots Selection


Density and Residual sugar, colored by quality buckets

Density and residual sugar have a strong correlation of 0.839.

After adjusting the scales and eliminating the outlier, we can see almost three distinct upward lines for the different quality buckets. This plot suggests that density level increases when residual sugar content increases. Poor quality wines tend to have a higher density level. Median and high quality wines, on the other hand, tend to have lower density levels.

Density and alcohol, colored by quality buckets

Density and residual sugar have a strong correlation of -0.78.

From the scatterplot above we can see that higher alcohol content is associated with lower density levels. Median and high quality wines tend to have higher alcohol content and lower density levels, while lower quality wines tend to have high density and lower alcohol content.

Alcohol and Chlorides, colored by quality

Alcohol and Chlorides have a moderate correlation of -0.36.

It seems like other than difference in alcohol content, some of the low quality wines tend to have higher chloride rate. However, the difference in chlorides seem to be minimal.

Alcohol and Residual Sugar, colored by quality

Apart from the correlation between alcohol and quality buckets, the difference in residual sugar seem to be minimal.

Modeling Wine Quality Ratings

## 
## Calls:
## m1: lm(formula = quality.int ~ alcohol, data = whiteWine)
## m2: lm(formula = quality.int ~ alcohol + density, data = whiteWine)
## m3: lm(formula = quality.int ~ alcohol + density + residual.sugar, 
##     data = whiteWine)
## m4: lm(formula = quality.int ~ alcohol + density + residual.sugar + 
##     chlorides, data = whiteWine)
## m5: lm(formula = quality.int ~ alcohol + density + residual.sugar + 
##     chlorides + prop_free.sulfur.dioxide, data = whiteWine)
## 
## =======================================================================================
##                                m1          m2          m3          m4          m5      
## ---------------------------------------------------------------------------------------
##   (Intercept)                2.582***  -22.492***   90.313***   87.563***   56.573***  
##                             (0.098)     (6.165)    (12.374)    (12.392)    (12.518)    
##   alcohol                    0.313***    0.360***    0.246***    0.237***    0.262***  
##                             (0.009)     (0.015)     (0.018)     (0.018)     (0.018)    
##   density                               24.728***  -87.886***  -84.931***  -54.289***  
##                                         (6.079)    (12.317)    (12.340)    (12.461)    
##   residual.sugar                                     0.053***    0.052***    0.038***  
##                                                     (0.005)     (0.005)     (0.005)    
##   chlorides                                                     -1.776**    -1.861***  
##                                                                 (0.555)     (0.548)    
##   prop_free.sulfur.dioxide                                                   1.404***  
##                                                                             (0.122)    
## ---------------------------------------------------------------------------------------
##   R-squared                     0.190      0.192       0.210       0.212       0.233   
##   adj. R-squared                0.190      0.192       0.210       0.211       0.232   
##   sigma                         0.797      0.796       0.787       0.787       0.776   
##   F                          1146.395    583.290     434.085     328.736     296.836   
##   p                             0.000      0.000       0.000       0.000       0.000   
##   Log-likelihood            -5839.391  -5831.127   -5776.812   -5771.696   -5705.710   
##   Deviance                   3112.257   3101.773    3033.737    3027.406    2946.925   
##   AIC                       11684.782  11670.255   11563.624   11555.391   11425.420   
##   BIC                       11704.272  11696.241   11596.107   11594.371   11470.896   
##   N                          4898       4898        4898        4898        4898       
## =======================================================================================
## 
## Call:
## lm(formula = quality.int ~ alcohol + density + residual.sugar + 
##     chlorides + prop_free.sulfur.dioxide, data = whiteWine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5838 -0.5291 -0.0377  0.4774  3.1759 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               56.572824  12.518291   4.519 6.35e-06 ***
## alcohol                    0.261991   0.018333  14.290  < 2e-16 ***
## density                  -54.289385  12.461186  -4.357 1.35e-05 ***
## residual.sugar             0.037838   0.005185   7.298 3.40e-13 ***
## chlorides                 -1.861363   0.547903  -3.397 0.000686 ***
## prop_free.sulfur.dioxide   1.404420   0.121504  11.559  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7761 on 4892 degrees of freedom
## Multiple R-squared:  0.2328, Adjusted R-squared:  0.232 
## F-statistic: 296.8 on 5 and 4892 DF,  p-value: < 2.2e-16

Among the models above, model 5 captured the most variance (adj. R^2 = 0.232) in the dataset and has the lowest BIC among all. We’ll take a look at the distribution of residuals of model 5.

We can see that the majority of residuals occurred in the 5 and 6 category, the errors of which are mostly within -1 and 1. An error of 1 in this case is pretty understandable. We can say that the model does a decent job of predicting the quality score of wines.

Multivariate Analysis


Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Alcohol concentration seems to be the deciding factor for evaluating wine quality. Although other features also seem to influence the wine, their influence on wine quality is rather indirect, as they correlate more strongly with alcohol level instead of with quality ratings directly.

Were there any interesting or surprising interactions between features?

It’s interesting to see that the density of wine decreases as the alcohol content increases. More surprisingly, we found that the level of residual sugar also decreases as the alcohol increases. It would be fascinating to learn more about the chemical/biological reaction taking place during wine productions.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Yes, I created a linear model starting from alcohol content and density level.
The variables in the linear model only accounted for 23.2% of the variance in the quality of wines. The addition of residual sugar, chloride, and free SO2 proportion slightly improved the R^2 value by 4%, which is expected base on the visualizations of correlations found between features. Also, as taking log10 of residual sugar does not improve the goodness of fit, the feature was included in the model in its original form.

Final Plots and Summary


Plot One

Discription One

There is a strong correlation between quality rating and alcohol levels. High quality wines tend to have higher alcohol content and poorer quality wines tend to have lower alcohol concentration.

Plot Two

Discription Two

From the plot we can see that alcohol and density of wine are negatively correlated. Median and high quality wines tend to have higher alcohol content and lower density levels, while lower quality wines tend to have high density and lower alcohol content.

Plot Three

Discription Three

After we chose to fit the model: quality = 56.57 + 0.26(alcohol) - 54.29(density) + 0.04(residual.sugar) - 1.86(chlorides) + 1.40(free.SO2 / total.SO2), the residuals are plotted as above. As we can see the majority of error comes in ratings of 5 and 6 within the range of -1 and 1, we can say that the model does a decent job describing the current dataset.

Reflection


The white wine dataset contains information about 5,000 labels of wine. I started by understanding the individual variables, then I explored the correlations between each pairs of features and had some interesting observations.

There was a trend between the alcohol concentration of wine and its quality. The trend is clearer when I regrouped the wines into three buckets by their quality score. Having quality rating with three levels made it easier to visualize the correlation with other features of wine.

With all the information I’ve found, I was able to create a linear model capturing the dynamic between different features of wines to predict white wine qualities. Although it only captures 23% of the variations in the dataset, the error is within an acceptable range. However, as the model only took linear variables into account, further adjustments and improvements can be made by exploring other possibilities.

References